import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_validate, RandomizedSearchCV
from sklearn.metrics import r2_score, SCORERS
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
Data Set Information:
Many variables are included so that algorithms that select or learn weights for attributes could be tested. However, clearly unrelated attributes were not included; attributes were picked if there was any plausible connection to crime (N=122), plus the attribute to be predicted (Per Capita Violent Crimes). The variables included in the dataset involve the community, such as the percent of the population considered urban, and the median family income, and involving law enforcement, such as per capita number of police officers, and percent of officers assigned to drug units.
The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. There was apparently some controversy in some states concerning the counting of rapes. These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. These cities are not included in the dataset. Many of these omitted communities were from the midwestern USA.
Data is described below based on original values. All numeric data was normalized into the decimal range 0.00-1.00 using an Unsupervised, equal-interval binning method. Attributes retain their distribution and skew (hence for example the population attribute has a mean value of 0.06 because most communities are small). E.g. An attribute described as 'mean people per household' is actually the normalized (0-1) version of that value.
The normalization preserves rough ratios of values WITHIN an attribute (e.g. double the value for double the population within the available precision - except for extreme values (all values more than 3 SD above the mean are normalized to 1.00; all values more than 3 SD below the mean are nromalized to 0.00)).
However, the normalization does not preserve relationships between values BETWEEN attributes (e.g. it would not be meaningful to compare the value for whitePerCap with the value for blackPerCap for a community)
A limitation was that the LEMAS survey was of the police departments with at least 100 officers, plus a random sample of smaller departments. For our purposes, communities not found in both census and crime datasets were omitted. Many communities are missing LEMAS data.
Attribute Information:
Attribute Information: (122 predictive, 5 non-predictive, 1 goal)
-- state: US state (by number) - not counted as predictive above, but if considered, should be consided nominal (nominal)
-- county: numeric code for county - not predictive, and many missing values (numeric)
-- community: numeric code for community - not predictive and many missing values (numeric)
-- communityname: community name - not predictive - for information only (string)
-- fold: fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric)
-- population: population for community: (numeric - decimal)
-- householdsize: mean people per household (numeric - decimal)
-- racepctblack: percentage of population that is african american (numeric - decimal)
-- racePctWhite: percentage of population that is caucasian (numeric - decimal)
-- racePctAsian: percentage of population that is of asian heritage (numeric - decimal)
-- racePctHisp: percentage of population that is of hispanic heritage (numeric - decimal)
-- agePct12t21: percentage of population that is 12-21 in age (numeric - decimal)
-- agePct12t29: percentage of population that is 12-29 in age (numeric - decimal)
-- agePct16t24: percentage of population that is 16-24 in age (numeric - decimal)
-- agePct65up: percentage of population that is 65 and over in age (numeric - decimal)
-- numbUrban: number of people living in areas classified as urban (numeric - decimal)
-- pctUrban: percentage of people living in areas classified as urban (numeric - decimal)
-- medIncome: median household income (numeric - decimal)
-- pctWWage: percentage of households with wage or salary income in 1989 (numeric - decimal)
-- pctWFarmSelf: percentage of households with farm or self employment income in 1989 (numeric - decimal)
-- pctWInvInc: percentage of households with investment / rent income in 1989 (numeric - decimal)
-- pctWSocSec: percentage of households with social security income in 1989 (numeric - decimal)
-- pctWPubAsst: percentage of households with public assistance income in 1989 (numeric - decimal)
-- pctWRetire: percentage of households with retirement income in 1989 (numeric - decimal)
-- medFamInc: median family income (differs from household income for non-family households) (numeric - decimal)
-- perCapInc: per capita income (numeric - decimal)
-- whitePerCap: per capita income for caucasians (numeric - decimal)
-- blackPerCap: per capita income for african americans (numeric - decimal)
-- indianPerCap: per capita income for native americans (numeric - decimal)
-- AsianPerCap: per capita income for people with asian heritage (numeric - decimal)
-- OtherPerCap: per capita income for people with 'other' heritage (numeric - decimal)
-- HispPerCap: per capita income for people with hispanic heritage (numeric - decimal)
-- NumUnderPov: number of people under the poverty level (numeric - decimal)
-- PctPopUnderPov: percentage of people under the poverty level (numeric - decimal)
-- PctLess9thGrade: percentage of people 25 and over with less than a 9th grade education (numeric - decimal)
-- PctNotHSGrad: percentage of people 25 and over that are not high school graduates (numeric - decimal)
-- PctBSorMore: percentage of people 25 and over with a bachelors degree or higher education (numeric - decimal)
-- PctUnemployed: percentage of people 16 and over, in the labor force, and unemployed (numeric - decimal)
-- PctEmploy: percentage of people 16 and over who are employed (numeric - decimal)
-- PctEmplManu: percentage of people 16 and over who are employed in manufacturing (numeric - decimal)
-- PctEmplProfServ: percentage of people 16 and over who are employed in professional services (numeric - decimal)
-- PctOccupManu: percentage of people 16 and over who are employed in manufacturing (numeric - decimal) ########
-- PctOccupMgmtProf: percentage of people 16 and over who are employed in management or professional occupations (numeric - decimal)
-- MalePctDivorce: percentage of males who are divorced (numeric - decimal)
-- MalePctNevMarr: percentage of males who have never married (numeric - decimal)
-- FemalePctDiv: percentage of females who are divorced (numeric - decimal)
-- TotalPctDiv: percentage of population who are divorced (numeric - decimal)
-- PersPerFam: mean number of people per family (numeric - decimal)
-- PctFam2Par: percentage of families (with kids) that are headed by two parents (numeric - decimal)
-- PctKids2Par: percentage of kids in family housing with two parents (numeric - decimal)
-- PctYoungKids2Par: percent of kids 4 and under in two parent households (numeric - decimal)
-- PctTeen2Par: percent of kids age 12-17 in two parent households (numeric - decimal)
-- PctWorkMomYoungKids: percentage of moms of kids 6 and under in labor force (numeric - decimal)
-- PctWorkMom: percentage of moms of kids under 18 in labor force (numeric - decimal)
-- NumIlleg: number of kids born to never married (numeric - decimal)
-- PctIlleg: percentage of kids born to never married (numeric - decimal)
-- NumImmig: total number of people known to be foreign born (numeric - decimal)
-- PctImmigRecent: percentage of immigrants who immigated within last 3 years (numeric - decimal)
-- PctImmigRec5: percentage of immigrants who immigated within last 5 years (numeric - decimal)
-- PctImmigRec8: percentage of immigrants who immigated within last 8 years (numeric - decimal)
-- PctImmigRec10: percentage of immigrants who immigated within last 10 years (numeric - decimal)
-- PctRecentImmig: percent of population who have immigrated within the last 3 years (numeric - decimal)
-- PctRecImmig5: percent of population who have immigrated within the last 5 years (numeric - decimal)
-- PctRecImmig8: percent of population who have immigrated within the last 8 years (numeric - decimal)
-- PctRecImmig10: percent of population who have immigrated within the last 10 years (numeric - decimal)
-- PctSpeakEnglOnly: percent of people who speak only English (numeric - decimal)
-- PctNotSpeakEnglWell: percent of people who do not speak English well (numeric - decimal)
-- PctLargHouseFam: percent of family households that are large (6 or more) (numeric - decimal)
-- PctLargHouseOccup: percent of all occupied households that are large (6 or more people) (numeric - decimal)
-- PersPerOccupHous: mean persons per household (numeric - decimal)
-- PersPerOwnOccHous: mean persons per owner occupied household (numeric - decimal)
-- PersPerRentOccHous: mean persons per rental household (numeric - decimal)
-- PctPersOwnOccup: percent of people in owner occupied households (numeric - decimal)
-- PctPersDenseHous: percent of persons in dense housing (more than 1 person per room) (numeric - decimal)
-- PctHousLess3BR: percent of housing units with less than 3 bedrooms (numeric - decimal)
-- MedNumBR: median number of bedrooms (numeric - decimal)
-- HousVacant: number of vacant households (numeric - decimal)
-- PctHousOccup: percent of housing occupied (numeric - decimal)
-- PctHousOwnOcc: percent of households owner occupied (numeric - decimal)
-- PctVacantBoarded: percent of vacant housing that is boarded up (numeric - decimal)
-- PctVacMore6Mos: percent of vacant housing that has been vacant more than 6 months (numeric - decimal)
-- MedYrHousBuilt: median year housing units built (numeric - decimal)
-- PctHousNoPhone: percent of occupied housing units without phone (in 1990, this was rare!) (numeric - decimal)
-- PctWOFullPlumb: percent of housing without complete plumbing facilities (numeric - decimal)
-- OwnOccLowQuart: owner occupied housing - lower quartile value (numeric - decimal)
-- OwnOccMedVal: owner occupied housing - median value (numeric - decimal)
-- OwnOccHiQuart: owner occupied housing - upper quartile value (numeric - decimal)
-- RentLowQ: rental housing - lower quartile rent (numeric - decimal)
-- RentMedian: rental housing - median rent (Census variable H32B from file STF1A) (numeric - decimal)
-- RentHighQ: rental housing - upper quartile rent (numeric - decimal)
-- MedRent: median gross rent (Census variable H43A from file STF3A - includes utilities) (numeric - decimal)
-- MedRentPctHousInc: median gross rent as a percentage of household income (numeric - decimal)
-- MedOwnCostPctInc: median owners cost as a percentage of household income - for owners with a mortgage (numeric - decimal)
-- MedOwnCostPctIncNoMtg: median owners cost as a percentage of household income - for owners without a mortgage (numeric - decimal)
-- NumInShelters: number of people in homeless shelters (numeric - decimal)
-- NumStreet: number of homeless people counted in the street (numeric - decimal)
-- PctForeignBorn: percent of people foreign born (numeric - decimal)
-- PctBornSameState: percent of people born in the same state as currently living (numeric - decimal)
-- PctSameHouse85: percent of people living in the same house as in 1985 (5 years before) (numeric - decimal)
-- PctSameCity85: percent of people living in the same city as in 1985 (5 years before) (numeric - decimal)
-- PctSameState85: percent of people living in the same state as in 1985 (5 years before) (numeric - decimal)
-- LemasSwornFT: number of sworn full time police officers (numeric - decimal)
-- LemasSwFTPerPop: sworn full time police officers per 100K population (numeric - decimal)
-- LemasSwFTFieldOps: number of sworn full time police officers in field operations (on the street as opposed to administrative etc) (numeric - decimal)
-- LemasSwFTFieldPerPop: sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population (numeric - decimal)
-- LemasTotalReq: total requests for police (numeric - decimal)
-- LemasTotReqPerPop: total requests for police per 100K popuation (numeric - decimal)
-- PolicReqPerOffic: total requests for police per police officer (numeric - decimal)
-- PolicPerPop: police officers per 100K population (numeric - decimal)
-- RacialMatchCommPol: a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar (numeric - decimal)
-- PctPolicWhite: percent of police that are caucasian (numeric - decimal)
-- PctPolicBlack: percent of police that are african american (numeric - decimal)
-- PctPolicHisp: percent of police that are hispanic (numeric - decimal)
-- PctPolicAsian: percent of police that are asian (numeric - decimal)
-- PctPolicMinor: percent of police that are minority of any kind (numeric - decimal)
-- OfficAssgnDrugUnits: number of officers assigned to special drug units (numeric - decimal)
-- NumKindsDrugsSeiz: number of different kinds of drugs seized (numeric - decimal)
-- PolicAveOTWorked: police average overtime worked (numeric - decimal)
-- LandArea: land area in square miles (numeric - decimal)
-- PopDens: population density in persons per square mile (numeric - decimal)
-- PctUsePubTrans: percent of people using public transit for commuting (numeric - decimal)
-- PolicCars: number of police cars (numeric - decimal)
-- PolicOperBudg: police operating budget (numeric - decimal)
-- LemasPctPolicOnPatr: percent of sworn full time police officers on patrol (numeric - decimal)
-- LemasGangUnitDeploy: gang unit deployed (numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time)
-- LemasPctOfficDrugUn: percent of officers assigned to drug units (numeric - decimal)
-- PolicBudgPerPop: police operating budget per population (numeric - decimal)
-- ViolentCrimesPerPop: total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute (to be predicted)
df = pd.read_csv("communities.data",header = None)
column_names = ['state',
'county',
'community',
'communityname',
'fold',
'population',
'householdsize',
'racepctblack',
'racePctWhite',
'racePctAsian',
'racePctHisp',
'agePct12t21',
'agePct12t29',
'agePct16t24',
'agePct65up',
'numbUrban',
'pctUrban',
'medIncome',
'pctWWage',
'pctWFarmSelf',
'pctWInvInc',
'pctWSocSec',
'pctWPubAsst',
'pctWRetire',
'medFamInc',
'perCapInc',
'whitePerCap',
'blackPerCap',
'indianPerCap',
'AsianPerCap',
'OtherPerCap',
'HispPerCap',
'NumUnderPov',
'PctPopUnderPov',
'PctLess9thGrade',
'PctNotHSGrad',
'PctBSorMore',
'PctUnemployed',
'PctEmploy',
'PctEmplManu',
'PctEmplProfServ',
'PctOccupManu',
'PctOccupMgmtProf',
'MalePctDivorce',
'MalePctNevMarr',
'FemalePctDiv',
'TotalPctDiv',
'PersPerFam',
'PctFam2Par',
'PctKids2Par',
'PctYoungKids2Par',
'PctTeen2Par',
'PctWorkMomYoungKids',
'PctWorkMom',
'NumIlleg',
'PctIlleg',
'NumImmig',
'PctImmigRecent',
'PctImmigRec5',
'PctImmigRec8',
'PctImmigRec10',
'PctRecentImmig',
'PctRecImmig5',
'PctRecImmig8',
'PctRecImmig10',
'PctSpeakEnglOnly',
'PctNotSpeakEnglWell',
'PctLargHouseFam',
'PctLargHouseOccup',
'PersPerOccupHous',
'PersPerOwnOccHous',
'PersPerRentOccHous',
'PctPersOwnOccup',
'PctPersDenseHous',
'PctHousLess3BR',
'MedNumBR',
'HousVacant',
'PctHousOccup',
'PctHousOwnOcc',
'PctVacantBoarded',
'PctVacMore6Mos',
'MedYrHousBuilt',
'PctHousNoPhone',
'PctWOFullPlumb',
'OwnOccLowQuart',
'OwnOccMedVal',
'OwnOccHiQuart',
'RentLowQ',
'RentMedian',
'RentHighQ',
'MedRent',
'MedRentPctHousInc',
'MedOwnCostPctInc',
'MedOwnCostPctIncNoMtg',
'NumInShelters',
'NumStreet',
'PctForeignBorn',
'PctBornSameState',
'PctSameHouse85',
'PctSameCity85',
'PctSameState85',
'LemasSwornFT',
'LemasSwFTPerPop',
'LemasSwFTFieldOps',
'LemasSwFTFieldPerPop',
'LemasTotalReq',
'LemasTotReqPerPop',
'PolicReqPerOffic',
'PolicPerPop',
'RacialMatchCommPol',
'PctPolicWhite',
'PctPolicBlack',
'PctPolicHisp',
'PctPolicAsian',
'PctPolicMinor',
'OfficAssgnDrugUnits',
'NumKindsDrugsSeiz',
'PolicAveOTWorked',
'LandArea',
'PopDens',
'PctUsePubTrans',
'PolicCars',
'PolicOperBudg',
'LemasPctPolicOnPatr',
'LemasGangUnitDeploy',
'LemasPctOfficDrugUn',
'PolicBudgPerPop',
'ViolentCrimesPerPop']
df.columns = column_names
df
| state | county | community | communityname | fold | population | householdsize | racepctblack | racePctWhite | racePctAsian | ... | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | ? | ? | Lakewoodcity | 1 | 0.19 | 0.33 | 0.02 | 0.90 | 0.12 | ... | 0.12 | 0.26 | 0.20 | 0.06 | 0.04 | 0.9 | 0.5 | 0.32 | 0.14 | 0.20 |
| 1 | 53 | ? | ? | Tukwilacity | 1 | 0.00 | 0.16 | 0.12 | 0.74 | 0.45 | ... | 0.02 | 0.12 | 0.45 | ? | ? | ? | ? | 0.00 | ? | 0.67 |
| 2 | 24 | ? | ? | Aberdeentown | 1 | 0.00 | 0.42 | 0.49 | 0.56 | 0.17 | ... | 0.01 | 0.21 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.43 |
| 3 | 34 | 5 | 81440 | Willingborotownship | 1 | 0.04 | 0.77 | 1.00 | 0.08 | 0.12 | ... | 0.02 | 0.39 | 0.28 | ? | ? | ? | ? | 0.00 | ? | 0.12 |
| 4 | 42 | 95 | 6096 | Bethlehemtownship | 1 | 0.01 | 0.55 | 0.02 | 0.95 | 0.09 | ... | 0.04 | 0.09 | 0.02 | ? | ? | ? | ? | 0.00 | ? | 0.03 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1989 | 12 | ? | ? | TempleTerracecity | 10 | 0.01 | 0.40 | 0.10 | 0.87 | 0.12 | ... | 0.01 | 0.28 | 0.05 | ? | ? | ? | ? | 0.00 | ? | 0.09 |
| 1990 | 6 | ? | ? | Seasidecity | 10 | 0.05 | 0.96 | 0.46 | 0.28 | 0.83 | ... | 0.02 | 0.37 | 0.20 | ? | ? | ? | ? | 0.00 | ? | 0.45 |
| 1991 | 9 | 9 | 80070 | Waterburytown | 10 | 0.16 | 0.37 | 0.25 | 0.69 | 0.04 | ... | 0.08 | 0.32 | 0.18 | 0.08 | 0.06 | 0.78 | 0 | 0.91 | 0.28 | 0.23 |
| 1992 | 25 | 17 | 72600 | Walthamcity | 10 | 0.08 | 0.51 | 0.06 | 0.87 | 0.22 | ... | 0.03 | 0.38 | 0.33 | 0.02 | 0.02 | 0.79 | 0 | 0.22 | 0.18 | 0.19 |
| 1993 | 6 | ? | ? | Ontariocity | 10 | 0.20 | 0.78 | 0.14 | 0.46 | 0.24 | ... | 0.11 | 0.30 | 0.05 | 0.08 | 0.04 | 0.73 | 0.5 | 1.00 | 0.13 | 0.48 |
1994 rows × 128 columns
Removing the first 5 columns as they are non-predictive and have many missing values
df = df.iloc[:,5:]
df[df == "?"].dropna(how= 'all')
| population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | ... | LandArea | PopDens | PctUsePubTrans | PolicCars | PolicOperBudg | LemasPctPolicOnPatr | LemasGangUnitDeploy | LemasPctOfficDrugUn | PolicBudgPerPop | ViolentCrimesPerPop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1986 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 1987 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 1988 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 1989 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
| 1990 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | ? | ? | ? | ? | NaN | ? | NaN |
1675 rows × 123 columns
It seems like 1675 instances has '?' in certain columns. Let's remove those columns.
df = df[df != "?"].dropna(axis=1)
df
| population | householdsize | racepctblack | racePctWhite | racePctAsian | racePctHisp | agePct12t21 | agePct12t29 | agePct16t24 | agePct65up | ... | PctForeignBorn | PctBornSameState | PctSameHouse85 | PctSameCity85 | PctSameState85 | LandArea | PopDens | PctUsePubTrans | LemasPctOfficDrugUn | ViolentCrimesPerPop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.19 | 0.33 | 0.02 | 0.90 | 0.12 | 0.17 | 0.34 | 0.47 | 0.29 | 0.32 | ... | 0.12 | 0.42 | 0.50 | 0.51 | 0.64 | 0.12 | 0.26 | 0.20 | 0.32 | 0.20 |
| 1 | 0.00 | 0.16 | 0.12 | 0.74 | 0.45 | 0.07 | 0.26 | 0.59 | 0.35 | 0.27 | ... | 0.21 | 0.50 | 0.34 | 0.60 | 0.52 | 0.02 | 0.12 | 0.45 | 0.00 | 0.67 |
| 2 | 0.00 | 0.42 | 0.49 | 0.56 | 0.17 | 0.04 | 0.39 | 0.47 | 0.28 | 0.32 | ... | 0.14 | 0.49 | 0.54 | 0.67 | 0.56 | 0.01 | 0.21 | 0.02 | 0.00 | 0.43 |
| 3 | 0.04 | 0.77 | 1.00 | 0.08 | 0.12 | 0.10 | 0.51 | 0.50 | 0.34 | 0.21 | ... | 0.19 | 0.30 | 0.73 | 0.64 | 0.65 | 0.02 | 0.39 | 0.28 | 0.00 | 0.12 |
| 4 | 0.01 | 0.55 | 0.02 | 0.95 | 0.09 | 0.05 | 0.38 | 0.38 | 0.23 | 0.36 | ... | 0.11 | 0.72 | 0.64 | 0.61 | 0.53 | 0.04 | 0.09 | 0.02 | 0.00 | 0.03 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1989 | 0.01 | 0.40 | 0.10 | 0.87 | 0.12 | 0.16 | 0.43 | 0.51 | 0.35 | 0.30 | ... | 0.22 | 0.28 | 0.34 | 0.48 | 0.39 | 0.01 | 0.28 | 0.05 | 0.00 | 0.09 |
| 1990 | 0.05 | 0.96 | 0.46 | 0.28 | 0.83 | 0.32 | 0.69 | 0.86 | 0.73 | 0.14 | ... | 0.53 | 0.25 | 0.17 | 0.10 | 0.00 | 0.02 | 0.37 | 0.20 | 0.00 | 0.45 |
| 1991 | 0.16 | 0.37 | 0.25 | 0.69 | 0.04 | 0.25 | 0.35 | 0.50 | 0.31 | 0.54 | ... | 0.25 | 0.68 | 0.61 | 0.79 | 0.76 | 0.08 | 0.32 | 0.18 | 0.91 | 0.23 |
| 1992 | 0.08 | 0.51 | 0.06 | 0.87 | 0.22 | 0.10 | 0.58 | 0.74 | 0.63 | 0.41 | ... | 0.45 | 0.64 | 0.54 | 0.59 | 0.52 | 0.03 | 0.38 | 0.33 | 0.22 | 0.19 |
| 1993 | 0.20 | 0.78 | 0.14 | 0.46 | 0.24 | 0.77 | 0.50 | 0.62 | 0.40 | 0.17 | ... | 0.68 | 0.50 | 0.34 | 0.35 | 0.68 | 0.11 | 0.30 | 0.05 | 1.00 | 0.48 |
1994 rows × 100 columns
ProfileReport(df,minimal=True)
The remaining columns seems okey. But we can further check their correlations among themselves.
plt.figure(figsize=(16,12))
sns.heatmap(df.corr(),cmap= "YlGnBu")
<AxesSubplot:>
It seems like there are highly correlated columns. However they are not much in number. Let's try to analyze this problem as is.
y = np.asarray(df["ViolentCrimesPerPop"])
X = np.asarray(df.drop("ViolentCrimesPerPop",1))
X.shape
(1994, 99)
y.shape
(1994,)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print("Original")
lasso_model_original = Lasso().fit(X_train,y_train)
lasso_prediction_original = lasso_model_original.predict(X_test)
print("R2 score: ", r2_score(y_test,lasso_prediction_original))
Original R2 score: -0.009519321319114482
print("Manual")
lasso_model_manual = Lasso(alpha = 0.01).fit(X_train,y_train)
lasso_prediction_manual = lasso_model_manual.predict(X_test)
print("R2 score: ", r2_score(y_test,lasso_prediction_manual))
Manual R2 score: 0.5785625428598712
lasso_grid_search = {'alpha': list(np.linspace(0, 0.01, 6, dtype=float))}
lasso_model_grid = GridSearchCV(estimator=Lasso(), param_grid = lasso_grid_search,
cv=5, n_jobs=-1, return_train_score = True)
lasso_model_grid.fit(X_train, y_train)
print("Grid")
print("Best alpha: ", lasso_model_grid.best_estimator_.alpha)
lasso_prediction_grid = lasso_model_grid.best_estimator_.predict(X_test)
print("R2 score: ", r2_score(y_test,lasso_prediction_grid))
Grid Best alpha: 0.002 R2 score: 0.6474102674847034
results = pd.DataFrame(lasso_model_grid.cv_results_)
results[results["rank_test_score"]<=5]["mean_train_score"]
0 0.708958 1 0.651444 2 0.629314 3 0.606781 4 0.580738 Name: mean_train_score, dtype: float64
results[results["rank_test_score"]<=5]["mean_test_score"]
0 0.636777 1 0.636954 2 0.620472 3 0.600450 4 0.577379 Name: mean_test_score, dtype: float64
It's interesting that the best estimator has alpha very close to 0. Let's check the LinearRegression model and Lasso(alpha = 0) model.
print("Linear Regression")
print("R2 score: ", r2_score(y_test,LinearRegression().fit(X_train,y_train).predict(X_test)))
print("Lasso(alpha=0)")
print("R2 score: ", r2_score(y_test,Lasso(alpha=0).fit(X_train,y_train).predict(X_test)))
Linear Regression R2 score: 0.6342758768190673 Lasso(alpha=0) R2 score: 0.633885506497719
<ipython-input-21-41a5025577f9>:4: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator
print("R2 score: ", r2_score(y_test,Lasso(alpha=0).fit(X_train,y_train).predict(X_test)))
/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:529: UserWarning: Coordinate descent with no regularization may lead to unexpected results and is discouraged.
model = cd_fast.enet_coordinate_descent(
/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:529: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 11.337931996365818, tolerance: 0.007610036823970038
model = cd_fast.enet_coordinate_descent(
They are very close to what we have found, however, they are not better.
print("Original")
dt_model_original = DecisionTreeRegressor().fit(X_train,y_train)
dt_prediction_original = dt_model_original.predict(X_test)
print("R2 score: ", r2_score(y_test,dt_prediction_original))
Original R2 score: 0.19931379923104697
print("Manual")
dt_model_manual = DecisionTreeRegressor(min_samples_leaf = 2 ,ccp_alpha = 0).fit(X_train,y_train)
dt_prediction_manual = dt_model_manual.predict(X_test)
print("R2 score: ", r2_score(y_test,dt_prediction_manual))
Manual R2 score: 0.24974542261008392
dt_grid_search = {'min_samples_leaf': list(np.linspace(20, 50, 6, dtype=int)),
'ccp_alpha': list(np.linspace(0, 1, 6, dtype=float))}
dt_model_grid = GridSearchCV(estimator=DecisionTreeRegressor(), param_grid = dt_grid_search,
verbose = 5,cv=5, n_jobs=-1, return_train_score = True)
dt_model_grid.fit(X_train, y_train)
print("Grid")
print("Best min_samples_leaf: ", dt_model_grid.best_estimator_.min_samples_leaf)
print("Best ccp_alpha: ", dt_model_grid.best_estimator_.ccp_alpha)
dt_prediction_grid = dt_model_grid.best_estimator_.predict(X_test)
print("R2 score: ", r2_score(y_test,dt_prediction_grid))
Fitting 5 folds for each of 36 candidates, totalling 180 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.2s [Parallel(n_jobs=-1)]: Done 96 tasks | elapsed: 2.4s
Grid Best min_samples_leaf: 32 Best ccp_alpha: 0.0 R2 score: 0.5171299268181065
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 4.2s finished
results = pd.DataFrame(dt_model_grid.cv_results_)
results[results["rank_test_score"]<=5]["mean_train_score"]
0 0.733943 1 0.707809 2 0.692256 3 0.675659 4 0.662603 Name: mean_train_score, dtype: float64
results[results["rank_test_score"]<=5]["mean_test_score"]
0 0.547835 1 0.535483 2 0.561885 3 0.550297 4 0.550578 Name: mean_test_score, dtype: float64
The best "the minimal number of observations per tree leaf" is 32 and best complexity is 0.0. The R2 score is worse than lasso regression. Maybe it could be improved further with more hyperparameter tuning.
print("Original")
rf_model_original = RandomForestRegressor().fit(X_train,y_train)
rf_prediction_original = rf_model_original.predict(X_test)
print("R2 score: ", r2_score(y_test,rf_prediction_original))
Original R2 score: 0.6211791340298725
print("Manual")
rf_model_manual = RandomForestRegressor(n_estimators = 500, min_samples_leaf = 5, max_samples = 150).fit(X_train,y_train)
rf_prediction_manual = rf_model_manual.predict(X_test)
print("R2 score: ", r2_score(y_test,rf_prediction_manual))
Manual R2 score: 0.6351480218214277
rf_grid_search = {'max_samples': list(np.linspace(0, 500, 6, dtype=int))}
rf_model_grid = GridSearchCV(estimator=RandomForestRegressor(n_estimators = 500, min_samples_leaf = 5), param_grid = rf_grid_search,
verbose = 5,cv=5, n_jobs=-1,return_train_score = True)
rf_model_grid.fit(X_train, y_train)
print("Grid")
print("Best max_samples: ", rf_model_grid.best_estimator_.max_samples)
rf_prediction_grid = rf_model_grid.best_estimator_.predict(X_test)
print("R2 score: ", r2_score(y_test,rf_prediction_grid))
Fitting 5 folds for each of 6 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.1s [Parallel(n_jobs=-1)]: Done 22 out of 30 | elapsed: 1.2min remaining: 25.2s [Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 1.9min finished
Grid Best max_samples: 400 R2 score: 0.6408589350000655
results = pd.DataFrame(rf_model_grid.cv_results_)
results[results["rank_test_score"]<=5]["mean_train_score"]
1 0.678127 2 0.725030 3 0.759660 4 0.786430 5 0.808287 Name: mean_train_score, dtype: float64
results[results["rank_test_score"]<=5]["mean_test_score"]
1 0.638505 2 0.650088 3 0.654067 4 0.657975 5 0.657198 Name: mean_test_score, dtype: float64
There is some improvement, however there is some overfitting.
print("Original")
gb_model_original = GradientBoostingRegressor().fit(X_train,y_train)
gb_prediction_original = gb_model_original.predict(X_test)
print("R2 score: ", r2_score(y_test,gb_prediction_original))
Original R2 score: 0.6144166573330152
print("Manual")
gb_model_manual = GradientBoostingRegressor(learning_rate = 0.01, n_estimators = 200, max_depth = 5).fit(X_train,y_train)
gb_prediction_manual = gb_model_manual.predict(X_test)
print("R2 score: ", r2_score(y_test,gb_prediction_manual))
Manual R2 score: 0.6036016091943279
gb_grid_search = {'max_depth': list(np.linspace(1, 6, 6, dtype=int)),
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': list(np.linspace(100, 300, 4, dtype=int))}
gb_model_grid = GridSearchCV(estimator=GradientBoostingRegressor(), param_grid = gb_grid_search,
verbose = 5,cv=5, n_jobs=-1, return_train_score = True)
gb_model_grid.fit(X_train, y_train)
print("Grid")
print("Best max_depth: ", gb_model_grid.best_estimator_.max_depth)
print("Best learning_rate: ", gb_model_grid.best_estimator_.learning_rate)
print("Best n_estimators: ", gb_model_grid.best_estimator_.n_estimators)
gb_prediction_grid = gb_model_grid.best_estimator_.predict(X_test)
print("R2 score: ", r2_score(y_test,gb_prediction_grid))
Fitting 5 folds for each of 72 candidates, totalling 360 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 1.5s [Parallel(n_jobs=-1)]: Done 56 tasks | elapsed: 43.1s [Parallel(n_jobs=-1)]: Done 146 tasks | elapsed: 2.5min [Parallel(n_jobs=-1)]: Done 272 tasks | elapsed: 5.1min [Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed: 7.6min finished
Grid Best max_depth: 4 Best learning_rate: 0.05 Best n_estimators: 233 R2 score: 0.6163982344940364
results = pd.DataFrame(gb_model_grid.cv_results_)
results[results["rank_test_score"]<=5]["mean_train_score"]
34 0.903759 35 0.925195 36 0.903737 37 0.939147 38 0.961856 Name: mean_train_score, dtype: float64
results[results["rank_test_score"]<=5]["mean_test_score"]
34 0.652836 35 0.652678 36 0.652679 37 0.653724 38 0.653903 Name: mean_test_score, dtype: float64
There is a lot of overfitting.
The best results were achieved by Random Forest Regressor.